Skip to content

#712 Read Variant metadata offset-size from the correct header bits#716

Merged
gunnarmorling merged 1 commit into
mainfrom
712-variant-metadata-offset-size
Jun 27, 2026
Merged

#712 Read Variant metadata offset-size from the correct header bits#716
gunnarmorling merged 1 commit into
mainfrom
712-variant-metadata-offset-size

Conversation

@gunnarmorling

Copy link
Copy Markdown
Collaborator

Fixes #712.

Problem

The Variant metadata header's offset_size_minus_one field lives in bits 6-7, but VariantBinary read it from bits 5-6 (METADATA_OFFSET_SIZE_SHIFT = 5). Every existing fixture uses offset_size = 1, where both readings yield 0, so the bug was invisible across the whole suite (125 byte-exact shredded-variant cases + the comparison sweep). Any Variant whose dictionary string section exceeds 255 bytes — needing offset_size >= 2 — was misparsed into garbage.

It surfaced via the negative_dictionary_size fixture from apache/parquet-testing#113, which uses offset_size = 4.

Fix

  • METADATA_OFFSET_SIZE_SHIFT 5 → 6 (+ corrected the layout comment).
  • Guard against a 4-byte dictionary_size that reads back as a negative int: reject with a clear "not a valid unsigned int" message rather than letting the bogus size drive later arithmetic. (The (dictSize+1)*offsetSize overflow widening is intentionally left to Unguarded 32-bit overflow in Variant object/array offset arithmetic #713 to avoid overlap.)

Tests

  • Real end-to-end fixture via tools/simple-datagen.py: variant_metadata_offset_size2.parquet — a VARIANT object whose metadata dictionary is 320 bytes, forcing offset_size = 2. VariantLogicalTypeTest.readsVariantWhoseMetadataUsesOffsetSizeTwo reads it through the VARIANT row API and resolves both long-named fields ('a'*160 → INT8 5, 'b'*160 → BOOLEAN_TRUE).
  • VariantMetadataTest.negativeDictionarySizeRejected for the guard.
  • Both new tests fail against the pre-fix shift (verified) and pass with the fix.
  • No regressions: the 125 shredded-variant byte-exact tests plus the variant decoder/encoder, row-API, and conversion tests are green.

Notes

// bit 4: sorted_strings flag
// bit 5-6: offset_size_minus_one (0..3 → 1..4 bytes)
// bit 7: unused
// bit 5: unused

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noice noice, i was confused as it's not specified in the docs, but found the correction here apache/parquet-format#574

@iifawzi iifawzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, nice catch, seems like it was gunnar vs variant today, seeing a couple of bugs!

@gunnarmorling

Copy link
Copy Markdown
Collaborator Author

Hehe, yeah; #711 has the overview. This was triggered by a mail on parquet-dev where I learned about a pending addition to parquet-testing for a range of malformed VARIANT files (context). One of the fixtures revealed that parsing bug here.

@gunnarmorling gunnarmorling force-pushed the 712-variant-metadata-offset-size branch 3 times, most recently from ca13218 to ef422dc Compare June 27, 2026 19:04
The metadata header's offset_size_minus_one field lives in bits 6-7, but it was
read from bits 5-6. Every existing fixture uses offset_size=1, where both
readings yield 0, so the bug stayed invisible across the whole suite; any
Variant whose dictionary string section exceeds 255 bytes (needing
offset_size >= 2) was misparsed into garbage.

Also guard a 4-byte dictionary_size that reads back as a negative int, and
enrich variant decode errors raised on the top-level read path with the
originating [fileName] prefix so they are attributable like other read errors.

Regression coverage uses a real file generated via simple-datagen (a 320-byte
dictionary needing offset_size=2) read end-to-end through the VARIANT row API,
plus a corrupt-metadata fixture asserting the enriched message. The bit-layout
comment and class @see cite the rendered Variant spec, and a stale
PqVariantObjectImpl field-lookup javadoc is corrected.
@gunnarmorling gunnarmorling force-pushed the 712-variant-metadata-offset-size branch from f316c79 to 0cb363f Compare June 27, 2026 19:35
@gunnarmorling gunnarmorling merged commit 0235776 into main Jun 27, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Variant metadata offset-size read off by one bit (bits 6-7, not 5-6)

2 participants